CMPINF 2120: Homework 08¶

Hua Ji!!!¶

Assigned: Thursday of Week 08 at 11:00PM¶

DUE: Thursday of Week 09 at 11:59PM¶

This assignment is focused on fitting logistic regression models for binary classification. You will work with a single data set that consists of mixed inputs and a binary outcome. The inputs names start with the letter x and the output is named y. You do not need to worry about preprocessing the continuous inputs for this assignment, you can focus on the programming details of fitting logistic regression models with statsmodels.

You will begin by exploring the data using figure appropriate for binary classification tasks. You then fit multiple models ranging from very simple to quite complex models. You will use information criterion metrics to identify the best models. You will interpret the best models by examining coefficients and visualizing predictions of the event probability.

You may add as many code and markdown cells as you see fit to answer the questions.

Collaborators¶

Type the names of other students you worked with here.

Problem 00¶

Let's import all required modules and functions for this assignment upfront.

0a)¶

Import the "big 4" data analysis modules of NumPy, Pandas, matplotlib pyplot, and Seaborn. You must use their common aliases.

0a) - SOLUTION¶

In [ ]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 

import seaborn as sns 

0b)¶

You will fit linear models via statsmodels and create feature arrays using dmatrices().

Import the statsmodels formula interface using its common alias. Import the dmatrices() function from the appropriate module.

0b) - SOLUTION¶

In [ ]:
import statsmodels.formula.api as smf
from patsy import dmatrices

Problem 01¶

You will work with a single data set for this assignment. This is a synthetic data set designed to represent common relationships you may encounter when training binary classifiers. The inputs themselves are not "challenging" so that you may focus on the relationships via features derived from the inputs in this application.

That said, it is essential to always explore data before training models! This problem requires you visually explore the data using figures appropriate binary classification tasks. However, the problem is open ended. There are no prompts for the specific types of figures you must make! You must decide which figures are critical for visually exploring the output to input relationships associated with binary classification problems.

HINT: The Week 08 lecture recordings help!

The synthetic data file name is assigned to the probA_path object in the cell below. The file name assumes that you have downloaded the hw08_probA.csv file into the same working directory as this notebook.

In [ ]:
probA_path = 'hw08_probA.csv'

Read in the data using the appropriate Pandas function and assign the result to the dfA object.

Visually explore the inputs, output, and the input-output relationships using figures appropriate for binary classification problems.

You may add as many cells as you feel are necessary.

1) - SOLUTION¶

In [ ]:
# read in data
dfA=pd.read_csv(probA_path)
In [ ]:
dfA.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555 entries, 0 to 554
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      555 non-null    float64
 1   x2      555 non-null    float64
 2   x3      555 non-null    object 
 3   y       555 non-null    int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 17.5+ KB
In [ ]:
dfA.nunique()
Out[ ]:
x1    555
x2    555
x3      5
y       2
dtype: int64
In [ ]:
dfA.isna().sum()
Out[ ]:
x1    0
x2    0
x3    0
y     0
dtype: int64
In [ ]:
sns.catplot(dfA, x='y', kind='count', height =4)
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [ ]:
dfA.y.value_counts(normalize=True)
Out[ ]:
y
1    0.526126
0    0.473874
Name: proportion, dtype: float64
In [ ]:
sns.catplot(dfA, x='x3', kind='count')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

We can see the outputs are balanced.

In [ ]:
sns.catplot(dfA, kind='box')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [ ]:
sns.displot(dfA, x='x1', kind='hist')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [ ]:
sns.displot(dfA, x='x2', kind='hist')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Since the boxplot shows that the magnitude of the inputs are almost the same, so we don't need standardization. Since x1 and x2 are nearly perfect symmetric, so we don't need to do the transformation.

In [ ]:
# explore the relationship between x1 and x2.
sns.relplot(dfA, x='x1',y='x2', hue='x3')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [ ]:
sns.relplot(dfA, x='x1', y='x2', col='x3',col_wrap=2, hue='y')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [ ]:
sns.lmplot(dfA, x='x1', y='y',logistic=True )
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [ ]:
sns.lmplot(dfA, x='x2', y='y', logistic=True)
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Problem 02¶

Now that you have explored the data, it is time to consider the types of features that you will include in your logistic regression models. The examples in the lecture recordings jumped straight to fitting the models, but you will first generate the design matrices (input feature arrays) before fitting the models. You will therefore know ahead of time how many features will be included in each model!

2a)¶

You will consider many models in this assignment ranging from simple to complex. You will create the output and input feature arrays for each model using dmatrices() to identify the number of features associated with each model. You will fit 11 models following the guidelines discussed in the lecture recordings. You must therefore generate 11 sets of output and input feature arrays in this problem. All output and input feature arrays must be returned as Pandas DataFrames (thus array is a misnomer relative to the data type). All objects must using the following naming convention:

  • The output feature array must be named yA_df_XX where XX is the model ID designation
  • The input feature array must be named XA_df_XX where XX is the model ID designation

The 11 models will include different features derived from the inputs. The 11 models you will consider are:

  • model 00: INTERCEPT-ONLY model
  • model 01: Only categorical input
  • model 02: Only continuous inputs - linear additive features
  • model 03: All inputs - linear additive features
  • model 04: Interact the categorical input with linear additive features from the continuous inputs
  • model 05: Add the categorical input to main effect and pairwise products (interactions) between continuous inputs
  • model 06: Interact the categorical input with main effect and pairwise products between continuous inputs
  • model 07: Add the categorical input to linear and quadratic features derived from all continuous inputs
  • model 08: Interact the categorical input with linear and quadratic features derived from all continuous inputs
  • model 09: Interact the categorical input with main effects, pairwise products, and quadratic features derived from all continuous inputs
  • model 10: Interact linear and quadratic features between the continuous inputs and interact all continuous features with the categorical input

IMPORTANT: Make sure the main effects are ALSO included in models with interactions!

You may add as many cells as you feel are necessary.

2a) - SOLUTION¶

In [ ]:
# INTERCEPT-ONLY model
yA_df_00, XA_df_00 =dmatrices('y~1', data=dfA, return_type="dataframe")
In [ ]:
yA_df_00
Out[ ]:
y
0 0.0
1 0.0
2 0.0
3 1.0
4 1.0
... ...
550 1.0
551 1.0
552 1.0
553 1.0
554 1.0

555 rows × 1 columns

In [ ]:
XA_df_00
Out[ ]:
Intercept
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
... ...
550 1.0
551 1.0
552 1.0
553 1.0
554 1.0

555 rows × 1 columns

In [ ]:
# Only categorical input
yA_df_01,XA_df_01=dmatrices('y~x3', data=dfA, return_type="dataframe")
In [ ]:
XA_df_01
Out[ ]:
Intercept x3[T.B] x3[T.C] x3[T.D] x3[T.E]
0 1.0 0.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
550 1.0 0.0 0.0 0.0 1.0
551 1.0 0.0 0.0 0.0 1.0
552 1.0 0.0 0.0 0.0 1.0
553 1.0 0.0 0.0 0.0 1.0
554 1.0 0.0 0.0 0.0 1.0

555 rows × 5 columns

In [ ]:
# Only continuous inputs - linear additive features  
yA_df_02,XA_df_02=dmatrices('y~x1+x2', data=dfA, return_type='dataframe')
In [ ]:
XA_df_02
Out[ ]:
Intercept x1 x2
0 1.0 -0.120768 -0.305824
1 1.0 1.042928 -2.007914
2 1.0 -2.639345 -0.221434
3 1.0 0.542346 2.671225
4 1.0 3.124682 0.956447
... ... ... ...
550 1.0 -1.337680 -0.441454
551 1.0 -1.848733 1.113199
552 1.0 -0.217562 0.432311
553 1.0 -2.698854 -0.852384
554 1.0 -0.542255 2.512838

555 rows × 3 columns

In [ ]:
# All inputs - linear additive features  
yA_df_03,XA_df_03=dmatrices('y~x1+x2 +x3', data=dfA, return_type='dataframe')
In [ ]:
XA_df_03
Out[ ]:
Intercept x3[T.B] x3[T.C] x3[T.D] x3[T.E] x1 x2
0 1.0 0.0 0.0 0.0 0.0 -0.120768 -0.305824
1 1.0 0.0 0.0 0.0 0.0 1.042928 -2.007914
2 1.0 0.0 0.0 0.0 0.0 -2.639345 -0.221434
3 1.0 0.0 0.0 0.0 0.0 0.542346 2.671225
4 1.0 0.0 0.0 0.0 0.0 3.124682 0.956447
... ... ... ... ... ... ... ...
550 1.0 0.0 0.0 0.0 1.0 -1.337680 -0.441454
551 1.0 0.0 0.0 0.0 1.0 -1.848733 1.113199
552 1.0 0.0 0.0 0.0 1.0 -0.217562 0.432311
553 1.0 0.0 0.0 0.0 1.0 -2.698854 -0.852384
554 1.0 0.0 0.0 0.0 1.0 -0.542255 2.512838

555 rows × 7 columns

In [ ]:
# Interact the categorical input with linear additive features from the continuous inputs
yA_df_04, XA_df_04=dmatrices('y~x3*(x1+x2)',data=dfA, return_type="dataframe")
In [ ]:
XA_df_04
Out[ ]:
Intercept x3[T.B] x3[T.C] x3[T.D] x3[T.E] x1 x3[T.B]:x1 x3[T.C]:x1 x3[T.D]:x1 x3[T.E]:x1 x2 x3[T.B]:x2 x3[T.C]:x2 x3[T.D]:x2 x3[T.E]:x2
0 1.0 0.0 0.0 0.0 0.0 -0.120768 -0.0 -0.0 -0.0 -0.000000 -0.305824 -0.0 -0.0 -0.0 -0.000000
1 1.0 0.0 0.0 0.0 0.0 1.042928 0.0 0.0 0.0 0.000000 -2.007914 -0.0 -0.0 -0.0 -0.000000
2 1.0 0.0 0.0 0.0 0.0 -2.639345 -0.0 -0.0 -0.0 -0.000000 -0.221434 -0.0 -0.0 -0.0 -0.000000
3 1.0 0.0 0.0 0.0 0.0 0.542346 0.0 0.0 0.0 0.000000 2.671225 0.0 0.0 0.0 0.000000
4 1.0 0.0 0.0 0.0 0.0 3.124682 0.0 0.0 0.0 0.000000 0.956447 0.0 0.0 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
550 1.0 0.0 0.0 0.0 1.0 -1.337680 -0.0 -0.0 -0.0 -1.337680 -0.441454 -0.0 -0.0 -0.0 -0.441454
551 1.0 0.0 0.0 0.0 1.0 -1.848733 -0.0 -0.0 -0.0 -1.848733 1.113199 0.0 0.0 0.0 1.113199
552 1.0 0.0 0.0 0.0 1.0 -0.217562 -0.0 -0.0 -0.0 -0.217562 0.432311 0.0 0.0 0.0 0.432311
553 1.0 0.0 0.0 0.0 1.0 -2.698854 -0.0 -0.0 -0.0 -2.698854 -0.852384 -0.0 -0.0 -0.0 -0.852384
554 1.0 0.0 0.0 0.0 1.0 -0.542255 -0.0 -0.0 -0.0 -0.542255 2.512838 0.0 0.0 0.0 2.512838

555 rows × 15 columns

In [ ]:
# Add the categorical input to main effect and pairwise products (interactions) between continuous inputs  
yA_df_05, XA_df_05=dmatrices('y~x3+(x1+x2)**2',data=dfA, return_type='dataframe')
In [ ]:
XA_df_05
Out[ ]:
Intercept x3[T.B] x3[T.C] x3[T.D] x3[T.E] x1 x2 x1:x2
0 1.0 0.0 0.0 0.0 0.0 -0.120768 -0.305824 0.036934
1 1.0 0.0 0.0 0.0 0.0 1.042928 -2.007914 -2.094110
2 1.0 0.0 0.0 0.0 0.0 -2.639345 -0.221434 0.584441
3 1.0 0.0 0.0 0.0 0.0 0.542346 2.671225 1.448729
4 1.0 0.0 0.0 0.0 0.0 3.124682 0.956447 2.988591
... ... ... ... ... ... ... ... ...
550 1.0 0.0 0.0 0.0 1.0 -1.337680 -0.441454 0.590524
551 1.0 0.0 0.0 0.0 1.0 -1.848733 1.113199 -2.058007
552 1.0 0.0 0.0 0.0 1.0 -0.217562 0.432311 -0.094055
553 1.0 0.0 0.0 0.0 1.0 -2.698854 -0.852384 2.300459
554 1.0 0.0 0.0 0.0 1.0 -0.542255 2.512838 -1.362600

555 rows × 8 columns

In [ ]:
# Interact the categorical input with main effect and pairwise products between continuous inputs
yA_df_06, XA_df_06=dmatrices('y~x3*((x1+x2)**2)',data=dfA, return_type="dataframe")
In [ ]:
XA_df_06
Out[ ]:
Intercept x3[T.B] x3[T.C] x3[T.D] x3[T.E] x1 x3[T.B]:x1 x3[T.C]:x1 x3[T.D]:x1 x3[T.E]:x1 x2 x3[T.B]:x2 x3[T.C]:x2 x3[T.D]:x2 x3[T.E]:x2 x1:x2 x3[T.B]:x1:x2 x3[T.C]:x1:x2 x3[T.D]:x1:x2 x3[T.E]:x1:x2
0 1.0 0.0 0.0 0.0 0.0 -0.120768 -0.0 -0.0 -0.0 -0.000000 -0.305824 -0.0 -0.0 -0.0 -0.000000 0.036934 0.0 0.0 0.0 0.000000
1 1.0 0.0 0.0 0.0 0.0 1.042928 0.0 0.0 0.0 0.000000 -2.007914 -0.0 -0.0 -0.0 -0.000000 -2.094110 -0.0 -0.0 -0.0 -0.000000
2 1.0 0.0 0.0 0.0 0.0 -2.639345 -0.0 -0.0 -0.0 -0.000000 -0.221434 -0.0 -0.0 -0.0 -0.000000 0.584441 0.0 0.0 0.0 0.000000
3 1.0 0.0 0.0 0.0 0.0 0.542346 0.0 0.0 0.0 0.000000 2.671225 0.0 0.0 0.0 0.000000 1.448729 0.0 0.0 0.0 0.000000
4 1.0 0.0 0.0 0.0 0.0 3.124682 0.0 0.0 0.0 0.000000 0.956447 0.0 0.0 0.0 0.000000 2.988591 0.0 0.0 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
550 1.0 0.0 0.0 0.0 1.0 -1.337680 -0.0 -0.0 -0.0 -1.337680 -0.441454 -0.0 -0.0 -0.0 -0.441454 0.590524 0.0 0.0 0.0 0.590524
551 1.0 0.0 0.0 0.0 1.0 -1.848733 -0.0 -0.0 -0.0 -1.848733 1.113199 0.0 0.0 0.0 1.113199 -2.058007 -0.0 -0.0 -0.0 -2.058007
552 1.0 0.0 0.0 0.0 1.0 -0.217562 -0.0 -0.0 -0.0 -0.217562 0.432311 0.0 0.0 0.0 0.432311 -0.094055 -0.0 -0.0 -0.0 -0.094055
553 1.0 0.0 0.0 0.0 1.0 -2.698854 -0.0 -0.0 -0.0 -2.698854 -0.852384 -0.0 -0.0 -0.0 -0.852384 2.300459 0.0 0.0 0.0 2.300459
554 1.0 0.0 0.0 0.0 1.0 -0.542255 -0.0 -0.0 -0.0 -0.542255 2.512838 0.0 0.0 0.0 2.512838 -1.362600 -0.0 -0.0 -0.0 -1.362600

555 rows × 20 columns

In [ ]:
# Add the categorical input to linear and quadratic features derived from all continuous inputs  
yA_df_07,XA_df_07=dmatrices('y~x3+x1+x2+np.power(x1,2)+np.power(x2,2)', data=dfA, return_type='dataframe')
In [ ]:
XA_df_07
Out[ ]:
Intercept x3[T.B] x3[T.C] x3[T.D] x3[T.E] x1 x2 np.power(x1, 2) np.power(x2, 2)
0 1.0 0.0 0.0 0.0 0.0 -0.120768 -0.305824 0.014585 0.093528
1 1.0 0.0 0.0 0.0 0.0 1.042928 -2.007914 1.087699 4.031717
2 1.0 0.0 0.0 0.0 0.0 -2.639345 -0.221434 6.966141 0.049033
3 1.0 0.0 0.0 0.0 0.0 0.542346 2.671225 0.294140 7.135445
4 1.0 0.0 0.0 0.0 0.0 3.124682 0.956447 9.763638 0.914790
... ... ... ... ... ... ... ... ... ...
550 1.0 0.0 0.0 0.0 1.0 -1.337680 -0.441454 1.789388 0.194881
551 1.0 0.0 0.0 0.0 1.0 -1.848733 1.113199 3.417813 1.239211
552 1.0 0.0 0.0 0.0 1.0 -0.217562 0.432311 0.047333 0.186893
553 1.0 0.0 0.0 0.0 1.0 -2.698854 -0.852384 7.283813 0.726558
554 1.0 0.0 0.0 0.0 1.0 -0.542255 2.512838 0.294041 6.314356

555 rows × 9 columns

In [ ]:
# Interact the categorical input with linear and quadratic features derived from all continuous inputs 
yA_df_08,XA_df_08=dmatrices('y~x3*(x1+x2+np.power(x1,2)+np.power(x2,2))',data=dfA,return_type="dataframe") 
In [ ]:
XA_df_08
Out[ ]:
Intercept x3[T.B] x3[T.C] x3[T.D] x3[T.E] x1 x3[T.B]:x1 x3[T.C]:x1 x3[T.D]:x1 x3[T.E]:x1 ... np.power(x1, 2) x3[T.B]:np.power(x1, 2) x3[T.C]:np.power(x1, 2) x3[T.D]:np.power(x1, 2) x3[T.E]:np.power(x1, 2) np.power(x2, 2) x3[T.B]:np.power(x2, 2) x3[T.C]:np.power(x2, 2) x3[T.D]:np.power(x2, 2) x3[T.E]:np.power(x2, 2)
0 1.0 0.0 0.0 0.0 0.0 -0.120768 -0.0 -0.0 -0.0 -0.000000 ... 0.014585 0.0 0.0 0.0 0.000000 0.093528 0.0 0.0 0.0 0.000000
1 1.0 0.0 0.0 0.0 0.0 1.042928 0.0 0.0 0.0 0.000000 ... 1.087699 0.0 0.0 0.0 0.000000 4.031717 0.0 0.0 0.0 0.000000
2 1.0 0.0 0.0 0.0 0.0 -2.639345 -0.0 -0.0 -0.0 -0.000000 ... 6.966141 0.0 0.0 0.0 0.000000 0.049033 0.0 0.0 0.0 0.000000
3 1.0 0.0 0.0 0.0 0.0 0.542346 0.0 0.0 0.0 0.000000 ... 0.294140 0.0 0.0 0.0 0.000000 7.135445 0.0 0.0 0.0 0.000000
4 1.0 0.0 0.0 0.0 0.0 3.124682 0.0 0.0 0.0 0.000000 ... 9.763638 0.0 0.0 0.0 0.000000 0.914790 0.0 0.0 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
550 1.0 0.0 0.0 0.0 1.0 -1.337680 -0.0 -0.0 -0.0 -1.337680 ... 1.789388 0.0 0.0 0.0 1.789388 0.194881 0.0 0.0 0.0 0.194881
551 1.0 0.0 0.0 0.0 1.0 -1.848733 -0.0 -0.0 -0.0 -1.848733 ... 3.417813 0.0 0.0 0.0 3.417813 1.239211 0.0 0.0 0.0 1.239211
552 1.0 0.0 0.0 0.0 1.0 -0.217562 -0.0 -0.0 -0.0 -0.217562 ... 0.047333 0.0 0.0 0.0 0.047333 0.186893 0.0 0.0 0.0 0.186893
553 1.0 0.0 0.0 0.0 1.0 -2.698854 -0.0 -0.0 -0.0 -2.698854 ... 7.283813 0.0 0.0 0.0 7.283813 0.726558 0.0 0.0 0.0 0.726558
554 1.0 0.0 0.0 0.0 1.0 -0.542255 -0.0 -0.0 -0.0 -0.542255 ... 0.294041 0.0 0.0 0.0 0.294041 6.314356 0.0 0.0 0.0 6.314356

555 rows × 25 columns

In [ ]:
# Interact the categorical input with main effects, pairwise products, and quadratic features derived from all continuous inputs  
yA_df_09,XA_df_09=dmatrices('y~x3*((x1+x2)**2+np.power(x1,2)+np.power(x2,2))',data=dfA, return_type="dataframe")
In [ ]:
XA_df_09
Out[ ]:
Intercept x3[T.B] x3[T.C] x3[T.D] x3[T.E] x1 x3[T.B]:x1 x3[T.C]:x1 x3[T.D]:x1 x3[T.E]:x1 ... np.power(x1, 2) x3[T.B]:np.power(x1, 2) x3[T.C]:np.power(x1, 2) x3[T.D]:np.power(x1, 2) x3[T.E]:np.power(x1, 2) np.power(x2, 2) x3[T.B]:np.power(x2, 2) x3[T.C]:np.power(x2, 2) x3[T.D]:np.power(x2, 2) x3[T.E]:np.power(x2, 2)
0 1.0 0.0 0.0 0.0 0.0 -0.120768 -0.0 -0.0 -0.0 -0.000000 ... 0.014585 0.0 0.0 0.0 0.000000 0.093528 0.0 0.0 0.0 0.000000
1 1.0 0.0 0.0 0.0 0.0 1.042928 0.0 0.0 0.0 0.000000 ... 1.087699 0.0 0.0 0.0 0.000000 4.031717 0.0 0.0 0.0 0.000000
2 1.0 0.0 0.0 0.0 0.0 -2.639345 -0.0 -0.0 -0.0 -0.000000 ... 6.966141 0.0 0.0 0.0 0.000000 0.049033 0.0 0.0 0.0 0.000000
3 1.0 0.0 0.0 0.0 0.0 0.542346 0.0 0.0 0.0 0.000000 ... 0.294140 0.0 0.0 0.0 0.000000 7.135445 0.0 0.0 0.0 0.000000
4 1.0 0.0 0.0 0.0 0.0 3.124682 0.0 0.0 0.0 0.000000 ... 9.763638 0.0 0.0 0.0 0.000000 0.914790 0.0 0.0 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
550 1.0 0.0 0.0 0.0 1.0 -1.337680 -0.0 -0.0 -0.0 -1.337680 ... 1.789388 0.0 0.0 0.0 1.789388 0.194881 0.0 0.0 0.0 0.194881
551 1.0 0.0 0.0 0.0 1.0 -1.848733 -0.0 -0.0 -0.0 -1.848733 ... 3.417813 0.0 0.0 0.0 3.417813 1.239211 0.0 0.0 0.0 1.239211
552 1.0 0.0 0.0 0.0 1.0 -0.217562 -0.0 -0.0 -0.0 -0.217562 ... 0.047333 0.0 0.0 0.0 0.047333 0.186893 0.0 0.0 0.0 0.186893
553 1.0 0.0 0.0 0.0 1.0 -2.698854 -0.0 -0.0 -0.0 -2.698854 ... 7.283813 0.0 0.0 0.0 7.283813 0.726558 0.0 0.0 0.0 0.726558
554 1.0 0.0 0.0 0.0 1.0 -0.542255 -0.0 -0.0 -0.0 -0.542255 ... 0.294041 0.0 0.0 0.0 0.294041 6.314356 0.0 0.0 0.0 6.314356

555 rows × 30 columns

In [ ]:
# Interact linear and quadratic features between the continuous inputs and interact all continuous features with the categorical input 
yA_df_10,XA_df_10=dmatrices('y~((x1+np.power(x1,2))*(x2+np.power(x2,2)))*x3', data=dfA, return_type="dataframe")
In [ ]:
XA_df_10
Out[ ]:
Intercept x3[T.B] x3[T.C] x3[T.D] x3[T.E] x1 x1:x3[T.B] x1:x3[T.C] x1:x3[T.D] x1:x3[T.E] ... np.power(x1, 2):x2 np.power(x1, 2):x2:x3[T.B] np.power(x1, 2):x2:x3[T.C] np.power(x1, 2):x2:x3[T.D] np.power(x1, 2):x2:x3[T.E] np.power(x1, 2):np.power(x2, 2) np.power(x1, 2):np.power(x2, 2):x3[T.B] np.power(x1, 2):np.power(x2, 2):x3[T.C] np.power(x1, 2):np.power(x2, 2):x3[T.D] np.power(x1, 2):np.power(x2, 2):x3[T.E]
0 1.0 0.0 0.0 0.0 0.0 -0.120768 -0.0 -0.0 -0.0 -0.000000 ... -0.004460 -0.0 -0.0 -0.0 -0.000000 0.001364 0.0 0.0 0.0 0.000000
1 1.0 0.0 0.0 0.0 0.0 1.042928 0.0 0.0 0.0 0.000000 ... -2.184006 -0.0 -0.0 -0.0 -0.000000 4.385296 0.0 0.0 0.0 0.000000
2 1.0 0.0 0.0 0.0 0.0 -2.639345 -0.0 -0.0 -0.0 -0.000000 ... -1.542542 -0.0 -0.0 -0.0 -0.000000 0.341572 0.0 0.0 0.0 0.000000
3 1.0 0.0 0.0 0.0 0.0 0.542346 0.0 0.0 0.0 0.000000 ... 0.785713 0.0 0.0 0.0 0.000000 2.098817 0.0 0.0 0.0 0.000000
4 1.0 0.0 0.0 0.0 0.0 3.124682 0.0 0.0 0.0 0.000000 ... 9.338398 0.0 0.0 0.0 0.000000 8.931679 0.0 0.0 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
550 1.0 0.0 0.0 0.0 1.0 -1.337680 -0.0 -0.0 -0.0 -1.337680 ... -0.789932 -0.0 -0.0 -0.0 -0.789932 0.348718 0.0 0.0 0.0 0.348718
551 1.0 0.0 0.0 0.0 1.0 -1.848733 -0.0 -0.0 -0.0 -1.848733 ... 3.804705 0.0 0.0 0.0 3.804705 4.235393 0.0 0.0 0.0 4.235393
552 1.0 0.0 0.0 0.0 1.0 -0.217562 -0.0 -0.0 -0.0 -0.217562 ... 0.020463 0.0 0.0 0.0 0.020463 0.008846 0.0 0.0 0.0 0.008846
553 1.0 0.0 0.0 0.0 1.0 -2.698854 -0.0 -0.0 -0.0 -2.698854 ... -6.208603 -0.0 -0.0 -0.0 -6.208603 5.292111 0.0 0.0 0.0 5.292111
554 1.0 0.0 0.0 0.0 1.0 -0.542255 -0.0 -0.0 -0.0 -0.542255 ... 0.738877 0.0 0.0 0.0 0.738877 1.856678 0.0 0.0 0.0 1.856678

555 rows × 45 columns

2b)¶

Which of the 11 models has the LEAST number of features?

Which of the 11 models has the MOST number of features?

Type your response in a Markdown cell below.

2b) - SOLUTION¶

In [ ]:
X_matrice=[XA_df_00,XA_df_01,XA_df_02,XA_df_03,XA_df_04,XA_df_04,XA_df_05,XA_df_06,XA_df_07,XA_df_08,XA_df_09,XA_df_10]
[matrice.shape[1] for matrice in X_matrice]
Out[ ]:
[1, 5, 3, 7, 15, 15, 8, 20, 9, 25, 30, 45]

The intercept only model has the least number of features. and the XA_df_10 has the MOST number of features.

2c)¶

Which of the 11 models will have the FEWEST regression coefficients to estimate?

Which of the 11 models will have the MOST regression coefficients to estimate?

Type your response in a Markdown cell below.

2c) - SOLUTION¶

In [ ]:
X_matrice=[XA_df_00,XA_df_01,XA_df_02,XA_df_03,XA_df_04,XA_df_04,XA_df_05,XA_df_06,XA_df_07,XA_df_08,XA_df_09,XA_df_10]
[matrice.shape[1] for matrice in X_matrice]
Out[ ]:
[1, 5, 3, 7, 15, 15, 8, 20, 9, 25, 30, 45]

The intercept only model has the FEWEST regression coefficients to estimate. And the 11 the model has the MOST coefficients to estimate.

2d)¶

This week lecture examples concluded by showing how Information Criterion metrics like AIC and BIC can be used to identify the BEST model on the training set alone while guarding against overfitting. This is accomplished by penalizing the training set model performance.

Which of the 11 models will have the SMALLEST penalty according to AIC?

Which of the 11 models will have the GREATEST penalty according to AIC?

Type your response in a Markdown cell below.

2d) - SOLUTION¶

Since $$\mathrm{AIC} = 2 \times k - 2\times \log\left( \mathrm{Likelihood} \right) $$, the more features in the model the more unknowns in the model, the greater the penalty!!!. We can say that Model_00 has the least features, so the penalty will be least and model_11 has the most features, so it will have the GREATEST penalty.

2e)¶

This week lecture examples concluded by showing how Information Criterion metrics like AIC and BIC can be used to identify the BEST model on the training set alone while guarding against overfitting. This is accomplished by penalizing the training set model performance.

Which of the 11 models will have the SMALLEST penalty according to BIC?

Which of the 11 models will have the GREATEST penalty according to BIC?

Type your response in a Markdown cell below.

2e) - SOLUTION¶

$$\mathrm{BIC} = k \times \log\left(N\right) - 2 \times \log\left( \mathrm{Likelihood} \right) $$

where N is the number of observations. So according to the BIC equation we can see that the penalty for BIC depends on the number of unknowns and the number of observations. According to the 11 models' feature numbers, model_00 has the least number of unknowns and observations, so model_00 has the Smallest penality according to BIC and Model_11 has the most number of unknowns and observations, so model_11 has the GREATEST penalty according to BIC

Problem 03¶

It is now time to fit the 11 logistic regression models! You must fit all models using the statsmodels formula interface. After fitting the models you will identify the best models using AIC and BIC.

3a)¶

The fitted models must be assigned to objects that follow a specific naming convention. The model objects must be named as fitA_XX where XX is the model ID designation. Each model uses a different set features and thus use different formulas. The formulas were described in the previous problem, but they are repeated below for your convenience.

  • model 00: INTERCEPT-ONLY model
  • model 01: Only categorical input
  • model 02: Only continuous inputs - linear additive features
  • model 03: All inputs - linear additive features
  • model 04: Interact the categorical input with linear additive features from the continuous inputs
  • model 05: Add the categorical input to main effect and pairwise products (interactions) between continuous inputs
  • model 06: Interact the categorical input with main effect and pairwise products between continuous inputs
  • model 07: Add the categorical input to linear and quadratic features derived from all continuous inputs
  • model 08: Interact the categorical input with linear and quadratic features derived from all continuous inputs
  • model 09: Interact the categorical input with main effects, pairwise products, and quadratic features derived from all continuous inputs
  • model 10: Interact linear and quadratic features between the continuous inputs and interact all continuous features with the categorical input

Fit all 11 models and assign the fitted model objects to the appropriate object names.

You may add as many cells as you feel are necessary.

3a) - SOLUTION¶

In [ ]:
fitA_00=smf.logit('y~1',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.691781
         Iterations 3
In [ ]:
fitA_00.params
Out[ ]:
Intercept    0.1046
dtype: float64
In [ ]:
fitA_01=smf.logit(formula='y~x3',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.687414
         Iterations 4
In [ ]:
fitA_01.params
Out[ ]:
Intercept   -5.406722e-02
x3[T.B]     -5.036176e-17
x3[T.C]      5.123748e-01
x3[T.D]      1.803609e-01
x3[T.E]      1.081344e-01
dtype: float64
In [ ]:
fitA_02=smf.logit(formula='y~x1+x2',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.663425
         Iterations 5
In [ ]:
fitA_03=smf.logit(formula='y~x1+x2+x3',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.656500
         Iterations 5
In [ ]:
fitA_04=smf.logit(formula='y~x3*(x1+x2)',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.569756
         Iterations 6
In [ ]:
fitA_05=smf.logit(formula='y~x3+(x1+x2)**2',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.655574
         Iterations 5
In [ ]:
fitA_06=smf.logit(formula='y~x3*((x1+x2)**2)',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.564722
         Iterations 7
In [ ]:
fitA_07=smf.logit(formula='y~x3+x1+x2+np.power(x1,2)+np.power(x2,2)',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.648063
         Iterations 5
In [ ]:
fitA_08=smf.logit(formula='y~x3*(x1+x2+np.power(x1,2)+np.power(x2,2))',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.531693
         Iterations 7
In [ ]:
fitA_09=smf.logit(formula='y~x3*((x1+x2)**2+np.power(x1,2)+np.power(x2,2))',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.525994
         Iterations 7
In [ ]:
fitA_10=smf.logit(formula='y~((x1+np.power(x1,2))*(x2+np.power(x2,2)))*x3',data=dfA).fit()
Optimization terminated successfully.
         Current function value: 0.514823
         Iterations 9

3b)¶

That is a lot of models! But which model is the best? We do NOT want to identify an overfit model! Therefore, we must use information criterion metrics to try and guard against overfitting!

Which model is considered to be the BEST according to AIC?

3b) - SOLUTION¶

In [ ]:
model_list=[fitA_00,fitA_01,fitA_02,fitA_03,fitA_04,fitA_05,fitA_06,fitA_07,fitA_08,fitA_09,fitA_10]
In [ ]:
model_results_df = pd.DataFrame({'model_name': ["fitA-%02d" %d for d in range(0,11)],
                                 'num_coef': [model.params.size for model in model_list],
                                 'LogLik': [model.llf for model in model_list],
                                 'AIC': [model.aic for model in model_list],
                                 'BIC': [model.bic for model in model_list]})
In [ ]:
model_results_df.sort_values(by='AIC')
Out[ ]:
model_name num_coef LogLik AIC BIC
8 fitA-08 25 -295.089801 640.179601 748.153804
9 fitA-09 30 -291.926924 643.853847 773.422891
10 fitA-10 45 -285.726813 661.453626 855.807191
4 fitA-04 15 -316.214501 662.429003 727.213525
6 fitA-06 20 -313.420726 666.841453 753.220815
7 fitA-07 9 -359.674991 737.349981 776.220694
2 fitA-02 3 -368.200684 742.401368 755.358272
3 fitA-03 7 -364.357366 742.714732 772.947509
5 fitA-05 8 -363.843567 743.687133 778.238878
0 fitA-00 1 -383.938682 769.877365 774.196333
1 fitA-01 5 -381.514568 773.029136 794.623976

According to AIC, model fitA_08 is the best model.

3c)¶

Which model is considerd to be the BEST according to BIC?

3c) - SOLUTION¶

In [ ]:
model_results_df.sort_values(by='BIC')
Out[ ]:
model_name num_coef LogLik AIC BIC
4 fitA-04 15 -316.214501 662.429003 727.213525
8 fitA-08 25 -295.089801 640.179601 748.153804
6 fitA-06 20 -313.420726 666.841453 753.220815
2 fitA-02 3 -368.200684 742.401368 755.358272
3 fitA-03 7 -364.357366 742.714732 772.947509
9 fitA-09 30 -291.926924 643.853847 773.422891
0 fitA-00 1 -383.938682 769.877365 774.196333
7 fitA-07 9 -359.674991 737.349981 776.220694
5 fitA-05 8 -363.843567 743.687133 778.238878
1 fitA-01 5 -381.514568 773.029136 794.623976
10 fitA-10 45 -285.726813 661.453626 855.807191

According to BIC, fitA_04 is the best model.

3d)¶

Let's interpret the best models to gain insights on how the inputs influence the event probability.

How many features are statistically significant for the AIC identified best model?

Which statistically significant feature has the highest MAGNITUDE coefficient for the AIC identified best model?

3d) - SOLUTION¶

In [ ]:
def my_coefplot(mod, figsize_use=(10,4)):
    fig, ax = plt.subplots(figsize=figsize_use)
    
    ax.errorbar(y =mod.params.index,
                x =mod.params,
                xerr = 2 * mod.bse,
                fmt='o', color='k', ecolor='k', elinewidth=2, ms=10)
    
    ax.axvline(x=0, linestyle='--', color='grey', linewidth=3.5)
    
    ax.set_xlabel('coefficient value')
    
    plt.show()
In [ ]:
my_coefplot(fitA_08)
No description has been provided for this image

Eight coefficients are statistically significant for the AIC identified best model. THe x3[T.E]:x1 and x3[T.D]:x1 features has the highest magnitude coefficient for the AIC indentified best model.

3e)¶

How many features are statistically significant for the BIC identified best model?

Which statistically significant feature has the highest MAGNITUDE coefficient for the BIC identified best model?

3d) - SOLUTION¶

In [ ]:
my_coefplot(fitA_04)
No description has been provided for this image

Eight features are statistically significant for the BIC identified best model too.

Features x3[T.E]:x1 and x3[T.D]:x1 has the highest MAGNITUDE coefficient for the BIC identified best model.

Problem 04¶

Let's conclude this assignment by visualizing predictions from the best models. You wil predict the event probability for the AIC and BIC identified best models to aid the interpretations and conclusions drawn from examining the coefficients.

4a)¶

However, you cannot make predictions without first defining an appropriate visualization grid! All inputs must be included in the new data set. You cannot ignore or forget about any of the inputs! However, that does not mean all inputs must use the same number of unique values! This application involves 2 continuous inputs and 1 categorical input. The grid will consist of many unique values of the continuous inputs and each unique value of the categorical input. The grid will therefore consider combinations not observed in the training data!

Create a DataFrame, input_A_viz, that consists of a three columns. The column names must be the same as the input names in the dfA DataFrame. The continuous inputs must include 101 evenly spaced values between their training set minimums and maximums. All categories associated with the categorical input must be included in the grid.

Display the number of unique values associated with each column in input_A_viz to confirm it is created successfully.

4a) - SOLUTION¶

In [ ]:
input_A_viz=pd.DataFrame([(x1,x2,x3) for x1 in np.linspace(dfA.x1.min(),dfA.x1.max(), num=101)
                                    for x2 in np.linspace(dfA.x2.min(),dfA.x2.max(), num=101)
                                    for x3 in dfA.x3.unique()],
                                    columns=['x1','x2','x3'])
In [ ]:
input_A_viz.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51005 entries, 0 to 51004
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      51005 non-null  float64
 1   x2      51005 non-null  float64
 2   x3      51005 non-null  object 
dtypes: float64(2), object(1)
memory usage: 1.2+ MB
In [ ]:
input_A_viz.nunique()
Out[ ]:
x1    101
x2    101
x3      5
dtype: int64

4b)¶

Let's visualize the input grid before making predictions to confirm it has been setup correctly.

Create a scatter plot in Seaborn where the x semantic variable is associated with the x1 input and the y semantic variable is associated with the x2 input. The column facets must be associated with the x3 input.

4b) - SOLUTION¶

In [ ]:
sns.relplot(data=input_A_viz,x='x1',y='x2',col='x3', col_wrap=2 )
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

4c)¶

You now have everything ready to make predictions!

Create a deep copy of input_A_viz and assign the copy to the df_A_viz object.

Create 2 new columns in df_A_viz. One of the columns must be named pred_prob_AIC and the other must be named pred_prob_BIC. Assign the predictions from the AIC identified best model to the pred_prob_AIC column. Assign the predictions from the BIC identified model to the pred_prob_BIC column.

Display the .head() of the df_A_viz object to the screen.

4c) - SOLUTION¶

In [ ]:
df_A_viz=input_A_viz.copy()
In [ ]:
df_A_viz['pred_prob_AIC']=fitA_08.predict(df_A_viz)
In [ ]:
df_A_viz['pred_prob_BIC'] =fitA_04.predict(df_A_viz)
In [ ]:
df_A_viz.head()
Out[ ]:
x1 x2 x3 pred_prob_AIC pred_prob_BIC
0 -5.270025 -4.622544 A 0.002187 0.000134
1 -5.270025 -4.622544 B 0.105997 0.046565
2 -5.270025 -4.622544 C 1.000000 0.521364
3 -5.270025 -4.622544 D 0.798110 0.976944
4 -5.270025 -4.622544 E 0.018166 0.436589

4d)¶

You will now visualize the predictions using a scatter plot organized consisently with 4b). The difference though is that the marker color will be associated with a model predicted event probability!

Visualize the predictions on the input grid associated with the AIC identified best model by associating the hue semantic variable with pred_prob_AIC. You must use a diverging color palette in the Seaborn scatter plot.

4d) - SOLUTION¶

In [ ]:
sns.relplot(data = df_A_viz, x='x1',y='x2',col='x3',hue='pred_prob_AIC', col_wrap=2)
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

4e)¶

Visualize the predictions on the input grid associated with the BIC identified best model by associating the hue semantic variable with pred_prob_BIC. You must use a diverging color palette in the Seaborn scatter plot.

4e) - SOLUTION¶

In [ ]:
sns.relplot(data =df_A_viz, x='x1',y='x2',col='x3',hue='pred_prob_BIC', col_wrap=2)
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

4f)¶

Describe the differences in the predicted probability surfaces between the AIC and BIC identified best models.

Type your response in a markdown cell below.

4f) - SOLUTION¶

The difference in the predictied probability surfaces between the AIC and BIC identified best models are mainly showed in the graph when x3=C, in AIC, we can see that the highest probability appears at the center place and In BIC, we almost can not see any high probability place.

4g)¶

Why are the BIC and AIC identified best models different? Or to put it another way, how is BIC different from AIC?

Type your response in a markdown cell below.

4g) - SOLUTION¶

The AIC and BIC are different is because AIC only depends on the loglikihood and the number of coefficient, but BIC depends on loglikelihood, number of coefficient(features) and the number of rows. From the question we have done, we can see that when there are more rows, the BIC will punish it more.

BONUS¶

Reshape the predictions in df_A_viz to support showing the predictions from BOTH models in a single figure. Create the figure such that the column facets must still be associated with the categorical input. The model must be associated with the row facets.

BONUS - SOLUTION¶

In [ ]:
lf_viz=df_A_viz.reset_index().rename(columns={'index':'rowid'}).\
melt(id_vars=['rowid','x1','x2','x3' ], value_vars=['pred_prob_AIC', 'pred_prob_BIC'])
In [ ]:
lf_viz
Out[ ]:
rowid x1 x2 x3 variable value
0 0 -5.270025 -4.622544 A pred_prob_AIC 0.002187
1 1 -5.270025 -4.622544 B pred_prob_AIC 0.105997
2 2 -5.270025 -4.622544 C pred_prob_AIC 1.000000
3 3 -5.270025 -4.622544 D pred_prob_AIC 0.798110
4 4 -5.270025 -4.622544 E pred_prob_AIC 0.018166
... ... ... ... ... ... ...
102005 51000 4.796841 4.687443 A pred_prob_BIC 0.999768
102006 51001 4.796841 4.687443 B pred_prob_BIC 0.944738
102007 51002 4.796841 4.687443 C pred_prob_BIC 0.690248
102008 51003 4.796841 4.687443 D pred_prob_BIC 0.030825
102009 51004 4.796841 4.687443 E pred_prob_BIC 0.637584

102010 rows × 6 columns

In [ ]:
sns.relplot(lf_viz, x='x1',y='x2',col='x3',row='variable',hue='value')
plt.show()
/opt/anaconda3/envs/cminf2129/lib/python3.8/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [ ]: